AITopics | parallel corpora

Collaborating Authors

parallel corpora

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Automated Snippet-Alignment Data Augmentation for Code Translation

Zhang, Zhiming, Zhu, Qingfu, Luo, Xianzhen, Wang, Yixuan, Li, Bohan, Che, Wanxiang

arXiv.org Artificial IntelligenceOct-20-2025

Code translation aims to translate the code from its source language to the target language and is used in various software development scenarios. Recent developments in Large Language Models (LLMs) have showcased their capabilities in code translation, and parallel corpora play a crucial role in training models for code translation. Parallel corpora can be categorized into program-alignment (PA) and snippet-alignment (SA) data. Although PA data has complete context and is suitable for semantic alignment learning, it may not provide adequate fine-grained training signals due to its extended length, while the brevity of SA data enables more fine-grained alignment learning. Due to limited parallel corpora, researchers explore several augmentation methods for code translation. Previous studies mainly focus on augmenting PA data. In this paper, we propose a data augmentation method that leverages LLMs to generate SA data automatically. To fully leverage both PA data and SA data, we explore a simple yet effective two-stage training strategy, which consistently enhances model performance compared to fine-tuning solely on PA data. Experiments on TransCoder-test demonstrate that our augmented SA data combined with the two-stage training approach yields consistent improvements over the baseline, achieving a maximum gain of 3.78% on pass@k.

large language model, natural language, sa data, (15 more...)

arXiv.org Artificial Intelligence

2510.15004

Country:

North America > United States (0.04)
North America > Canada > Ontario > Toronto (0.04)
Asia > Thailand > Bangkok > Bangkok (0.04)
(5 more...)

Genre: Research Report > New Finding (0.46)

Industry: Automobiles & Trucks > Parts Supplier (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)

Add feedback

CorIL: Towards Enriching Indian Language to Indian Language Parallel Corpora and Machine Translation Systems

Bhattacharjee, Soham, Roy, Mukund K, Poojary, Yathish, Dave, Bhargav, Raj, Mihir, Mujadia, Vandan, Gain, Baban, Mishra, Pruthwik, Ahsan, Arafat, Krishnamurthy, Parameswari, Rao, Ashwath, Josan, Gurpreet Singh, Dubey, Preeti, Kak, Aadil Amin, Kulkarni, Anna Rao, VG, Narendra, Arora, Sunita, Balbantray, Rakesh, Majumdar, Prasenjit, Arora, Karunesh K, Ekbal, Asif, Sharma, Dipti Mishra

arXiv.org Artificial IntelligenceSep-25-2025

India's linguistic landscape is one of the most diverse in the world, comprising over 120 major languages and approximately 1,600 additional languages, with 22 officially recognized as scheduled languages in the Indian Constitution. Despite recent progress in multilingual neural machine translation (NMT), high-quality parallel corpora for Indian languages remain scarce, especially across varied domains. In this paper, we introduce a large-scale, high-quality annotated parallel corpus covering 11 of these languages : English, Telugu, Hindi, Punjabi, Odia, Kashmiri, Sindhi, Dogri, Kannada, Urdu, and Gujarati comprising a total of 772,000 bi-text sentence pairs. The dataset is carefully curated and systematically categorized into three key domains: Government, Health, and General, to enable domain-aware machine translation research and facilitate effective domain adaptation. To demonstrate the utility of CorIL and establish strong benchmarks for future research, we fine-tune and evaluate several state-of-the-art NMT models, including IndicTrans2, NLLB, and BhashaVerse. Our analysis reveals important performance trends and highlights the corpus's value in probing model capabilities. For instance, the results show distinct performance patterns based on language script, with massively multilingual models showing an advantage on Perso-Arabic scripts (Urdu, Sindhi) while other models excel on Indic scripts. This paper provides a detailed domain-wise performance analysis, offering insights into domain sensitivity and cross-script transfer learning. By publicly releasing CorIL, we aim to significantly improve the availability of high-quality training data for Indian languages and provide a valuable resource for the machine translation research community.

artificial intelligence, natural language, translation, (18 more...)

arXiv.org Artificial Intelligence

2509.19941

Country:

Asia > Pakistan (0.04)
Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
Europe > Iceland > Capital Region > Reykjavik (0.04)
(18 more...)

Genre: Research Report > New Finding (0.48)

Industry:

Government > Regional Government > Asia Government > India Government (0.54)
Law > Intellectual Property & Technology Law (0.46)
Health & Medicine > Consumer Health (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

CUTE: A Multilingual Dataset for Enhancing Cross-Lingual Knowledge Transfer in Low-Resource Languages

Zhuang, Wenhao, Sun, Yuan

arXiv.org Artificial IntelligenceSep-23-2025

Large Language Models (LLMs) demonstrate exceptional zero-shot capabilities in various NLP tasks, significantly enhancing user experience and efficiency. However, this advantage is primarily limited to resource-rich languages. For the diverse array of low-resource languages, support remains inadequate, with the scarcity of training corpora considered the primary cause. We construct and open-source CUTE Chinese, Uyghur, Tibetan,English dataset, consisting of two 25GB sets of four-language corpora (one parallel and one non-parallel), obtained through machine translation. CUTE encompasses two resource-rich languages (Chinese and English) and two low-resource languages (Uyghur and Tibetan). Prior to constructing CUTE, human assessment validates that the machine translation quality between Chinese-Uyghur and Chinese-Tibetan approaches that of Chinese-English translation. CUTE represents the largest open-source corpus for Uyghur and Tibetan languages to date, and we demonstrate its effectiveness in enhancing LLMs' ability to process low-resource languages while investigating the role of corpus parallelism in cross-lingual transfer learning. The CUTE corpus and related models are made publicly available to the research community.

artificial intelligence, large language model, natural language, (15 more...)

arXiv.org Artificial Intelligence

2509.16914

Country:

Asia > China > Beijing > Beijing (0.40)
Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

UPRPRC: Unified Pipeline for Reproducing Parallel Resources -- Corpus from the United Nations

Lu, Qiuyang, Shen, Fangjian, Tang, Zhengkai, Liu, Qiang, Cheng, Hexuan, Liu, Hui, Wen, Wushao

arXiv.org Artificial IntelligenceSep-22-2025

The quality and accessibility of multilingual datasets are crucial for advancing machine translation. However, previous corpora built from United Nations documents have suffered from issues such as opaque process, difficulty of reproduction, and limited scale. To address these challenges, we introduce a complete end-to-end solution, from data acquisition via web scraping to text alignment. The entire process is fully reproducible, with a minimalist single-machine example and optional distributed computing steps for scalability. At its core, we propose a new Graph-Aided Paragraph Alignment (GAPA) algorithm for efficient and flexible paragraph-level alignment. The resulting corpus contains over 713 million English tokens, more than doubling the scale of prior work. To the best of our knowledge, this represents the largest publicly available parallel corpus composed entirely of human-translated, non-AI-generated content. Our code and corpus are accessible under the MIT License.

artificial intelligence, natural language, paragraph, (15 more...)

arXiv.org Artificial Intelligence

2509.15789

Country:

North America > Canada > Ontario > National Capital Region > Ottawa (0.04)
Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
Europe > Middle East > Malta (0.04)
(4 more...)

Genre: Research Report (0.40)

Industry: Government > Intergovernmental Programs (0.63)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

The TUB Sign Language Corpus Collection

Avramidis, Eleftherios, Czehmann, Vera, Deckert, Fabian, Hufe, Lorenz, Lipski, Aljoscha, Villalobos, Yuni Amaloa Quintero, Rhee, Tae Kwon, Shi, Mengqian, Stölting, Lennart, Nunnari, Fabrizio, Möller, Sebastian

arXiv.org Artificial IntelligenceAug-8-2025

We present a collection of parallel corpora of 12 sign languages in video format, together with subtitles in the dominant spoken languages of the corresponding countries. The entire collection includes more than 1,300 hours in 4,381 video files, accompanied by 1,3~M subtitles containing 14~M tokens. Most notably, it includes the first consistent parallel corpora for 8 Latin American sign languages, whereas the size of the German Sign Language corpora is ten times the size of the previously available corpora. The collection was created by collecting and processing videos of multiple sign languages from various online sources, mainly broadcast material of news shows, governmental bodies and educational channels. The preparation involved several stages, including data collection, informing the content creators and seeking usage approvals, scraping, and cropping. The paper provides statistics on the collection and an overview of the methods used to collect the data.

artificial intelligence, machine translation, natural language, (15 more...)

arXiv.org Artificial Intelligence

2508.05374

Country:

North America > Mexico (0.14)
Europe > Germany > Berlin (0.07)
South America > Colombia (0.05)
(18 more...)

Genre: Research Report (0.40)

Industry: Education > Curriculum > Subject-Specific Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (0.93)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.69)

Add feedback

Building and Aligning Comparable Corpora

Saad, Motaz, Langlois, David, Smaili, Kamel

arXiv.org Artificial IntelligenceAug-5-2025

Comparable corpus is a set of topic aligned documents in multiple languages, which are not necessarily translations of each other. These documents are useful for multilingual natural language processing when there is no parallel text available in some domains or languages. In addition, comparable documents are informative because they can tell what is being said about a topic in different languages. In this paper, we present a method to build comparable corpora from Wikipedia encyclopedia and EURONEWS website in English, French and Arabic languages. We further experiment a method to automatically align comparable documents using cross-lingual similarity measures. We investigate two cross-lingual similarity measures to align comparable documents. The first measure is based on bilingual dictionary, and the second measure is based on Latent Semantic Indexing (LSI). Experiments on several corpora show that the Cross-Lingual LSI (CL-LSI) measure outperforms the dictionary based measure. Finally, we collect English and Arabic news documents from the British Broadcast Corporation (BBC) and from ALJAZEERA (JSC) news website respectively. Then we use the CL-LSI similarity measure to automatically align comparable documents of BBC and JSC. The evaluation of the alignment shows that CL-LSI is not only able to align cross-lingual documents at the topic level, but also it is able to do this at the event level.

data mining, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2508.02555

Country:

Europe > France (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Bulgaria (0.04)
(15 more...)

Genre: Research Report (1.00)

Industry: Media > News (0.93)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Edeflip: Supervised Word Translation between English and Yoruba

Abioye, Ikeoluwa, Ge, Jiani

arXiv.org Artificial IntelligenceJun-17-2025

In recent years, embedding alignment has become the state-of-the-art machine translation approach, as it can yield high-quality translation without training on parallel corpora. However, existing research and application of embedding alignment mostly focus on high-resource languages with high-quality monolingual embeddings. It is unclear if and how low-resource languages may be similarly benefited. In this study, we implement an established supervised embedding alignment method for word translation from English to Yoruba, the latter a low-resource language. We found that higher embedding quality and normalizing embeddings increase word translation precision, with, additionally, an interaction effect between the two. Our results demonstrate the limitations of the state-of-the-art supervised embedding alignment when it comes to low-resource languages, for which there are additional factors that need to be taken into consideration, such as the importance of curating high-quality monolingual embeddings. We hope our work will be a starting point for further machine translation research that takes into account the challenges that low-resource languages face.

artificial intelligence, natural language, translation, (17 more...)

arXiv.org Artificial Intelligence

2506.1302

Country:

Europe > Italy > Tuscany > Florence (0.04)
Asia > Indonesia > Bali (0.04)
Africa > West Africa (0.04)
Africa > Niger (0.04)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Parallel Corpora for Machine Translation in Low-resource Indic Languages: A Comprehensive Review

Raja, Rahul, Vats, Arpita

arXiv.org Artificial IntelligenceMar-2-2025

Parallel corpora play an important role in training machine translation (MT) models, particularly for low-resource languages where high-quality bilingual data is scarce. This review provides a comprehensive overview of available parallel corpora for Indic languages, which span diverse linguistic families, scripts, and regional variations. We categorize these corpora into text-to-text, code-switched, and various categories of multimodal datasets, highlighting their significance in the development of robust multilingual MT systems. Beyond resource enumeration, we critically examine the challenges faced in corpus creation, including linguistic diversity, script variation, data scarcity, and the prevalence of informal textual content.We also discuss and evaluate these corpora in various terms such as alignment quality and domain representativeness. Furthermore, we address open challenges such as data imbalance across Indic languages, the trade-off between quality and quantity, and the impact of noisy, informal, and dialectal data on MT performance. Finally, we outline future directions, including leveraging cross-lingual transfer learning, expanding multilingual datasets, and integrating multimodal resources to enhance translation quality. To the best of our knowledge, this paper presents the first comprehensive review of parallel corpora specifically tailored for low-resource Indic languages in the context of machine translation.

corpora, dataset, translation, (14 more...)

arXiv.org Artificial Intelligence

2503.04797

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > Indonesia > Bali (0.04)
(30 more...)

Genre: Overview (1.00)

Industry:

Education (0.67)
Government (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Improving the quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics

Fernando, Aloka, Ranathunga, Surangika, de Silva, Nisansa

arXiv.org Artificial IntelligenceFeb-26-2025

Parallel Data Curation (PDC) techniques aim to filter out noisy parallel sentences from the web-mined corpora. Prior research has demonstrated that ranking sentence pairs using similarity scores on sentence embeddings derived from Pre-trained Multilingual Language Models (multiPLMs) and training the NMT systems with the top-ranked samples, produces superior NMT performance than when trained using the full dataset. However, previous research has shown that the choice of multiPLM significantly impacts the ranking quality. This paper investigates the reasons behind this disparity across multiPLMs. Using the web-mined corpora CCMatrix and CCAligned for En$\rightarrow$Si, En$\rightarrow$Ta and Si$\rightarrow$Ta, we show that different multiPLMs (LASER3, XLM-R, and LaBSE) are biased towards certain types of sentences, which allows noisy sentences to creep into the top-ranked samples. We show that by employing a series of heuristics, this noise can be removed to a certain extent. This results in improving the results of NMT systems trained with web-mined corpora and reduces the disparity across multiPLMs.

corpora, language pair, multiplm, (16 more...)

arXiv.org Artificial Intelligence

2502.19074

Country:

Asia > Sri Lanka (0.04)
Oceania > New Zealand > North Island > Manawatū-Whanganui > Palmerston North (0.04)
Europe > Belgium (0.04)
(2 more...)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Comparable Corpora: Opportunities for New Research Directions

Church, Kenneth

arXiv.org Artificial IntelligenceJan-24-2025

Most conference papers present new results, but this paper will focus more on opportunities for the audience to make their own contributions. This paper is intended to challenge the community to think more broadly about what we can do with comparable corpora. We will start with a review of the history, and then suggest new directions for future research. This was a keynote at BUCC-2025, a workshop associated with Coling-2025.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2501.14721

Country:

Asia > China > Hong Kong (0.05)
North America > United States > New York (0.04)
Europe > Finland > Uusimaa > Helsinki (0.04)
(12 more...)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.96)
Information Technology > Communications (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Add feedback